Automatic Extraction of Data Points and Text Blocks from 2-Dimensional Plots in Digital Documents

نویسندگان

  • Saurabh Kataria
  • William Browuer
  • Prasenjit Mitra
  • C. Lee Giles
چکیده

Two dimensional plots (2-D) in digital documents on the web are an important source of information that is largely under-utilized. In this paper, we outline how data and text can be extracted automatically from these 2-D plots, thus eliminating a time consuming manual process. Our information extraction algorithm identifies the axes of the figures, extracts text blocks like axes-labels and legends and identifies data points in the figure. It also extracts the units appearing in the axes labels and segments the legends to identify the different lines in the legend, the different symbols and their associated text explanations. Our algorithm also performs the challenging task of separating out overlapping text and data points effectively. Our experiments indicate that these techniques are computationally efficient and provide acceptable accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents

Most search engines index the textual content of documents in digital libraries. However, scholarly articles frequently report important findings in figures for visual impact and the contents of these figures are not indexed. These contents are often invaluable to the researcher in various fields, for the purposes of direct comparison with their own work. Therefore, searching for figures and ex...

متن کامل

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

Biogeography-Based Optimization Algorithm for Automatic Extractive Text Summarization

    Given the increasing number of documents, sites, online sources, and the users’ desire to quickly access information, automatic textual summarization has caught the attention of many researchers in this field. Researchers have presented different methods for text summarization as well as a useful summary of those texts including relevant document sentences. This study select...

متن کامل

Automatic Documents Annotation by Keyphrase Extraction in Digital Libraries using Taxonomy

Keyphrases are useful for variety of purposes including: text clustering, classification, content-based retrieval, and automatic text summarization. A small amount of documents have author-assigned keyphrases. Manual assignment of the keyphrases to existing documents is a tedious task, therefore, automatic keyphrase extraction has been extensively used to organize documents. Existing automatic ...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008